Dependent Bigram Identification

نویسنده

Ted Pedersen

چکیده

Dependent bigrams are two consecutive words that occur together in a text more often than would be expected purely by chance. Identifying such bigrams is an important issue since they provide valuable clues for machine translation, word sense disambiguation, and information retrieval. A variety of significance tests have been proposed (e.g., Church et. al., 1991, Dunning, 1993, Pedersen et. al, 1996) to identify these interesting lexical pairs. In this poster I present a new statistic, minimum sensitivity, that is simple to compute and is free from the underlying distributional assumptions commonly made by significance tests. The challenge in identifying dependent bigrams is that most are relatively rare regardless of the amount of text being considered. This follows from the distributional tendencies of individual bigrams as described by Zipf’s Law. If the frequencies of the bigrams in a text are ordered from most to least frequent, (fl, f~, ..., f,,), these frequencies roughly obey fi oc Consider the following example from a 1,300,000 word sample of the ACL/DCI Wall Street Journal Corpus. A contingency table containing the frequency counts of oil and industry is shown below. These counts show that oil industry occurs 17 times, oil occurs without industry 240 times, industry occurs without oil 1001 times, and bigrams other than oil industry occur 1,298,742 times. This distribution is sparse and skewed and thus violates a central assumption implicit in significance testing of contingency tables (l~ead Cressie 1988).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language identification incorporating lexical information

In this paper we explore the use of lexical information for language identification (LID). Our reference LID system uses language-dependent acoustic phone models and phone-based bigram language models. For each language, lexical information is introduced by augmenting the phone vocabulary with the N most frequent words in the training data. Combined phone and word bigram models are used to prov...

متن کامل

Double bigram-decoding in phonotactic language identification

In this paper a phonotactic language identi cation system that employs a multilingual phone-recognizer with multiple language-dependent grammars to tokenize the spoken signal into several phone-streams is described. For each stream an independent set of language models is used to compute the language scores that are subsequently processed by two classi cation stages. Thus, the system acquires i...

متن کامل

A Machine Learning Approach for the Identification of Bengali Noun-Noun Compound Multiword Expressions

This paper presents a machine learning approach for identification of Bengali multiword expressions (MWE) which are bigram nominal compounds. Our proposed approach has two steps: (1) candidate extraction using chunk information and various heuristic rules and (2) training the machine learning algorithm called Random Forest to classify the candidates into two groups: bigram nominal compound MWE ...

متن کامل

Chinese Unknown Word Identification Based on Local Bigram Model with Integrally Smoothing Assumption

The paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. To explain this local approximation, we make an “integrally smoothing assumption”. As a simplifica...

متن کامل

Chinese Unknown Word Identification Based on Local Bigram Model

This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine thes...

متن کامل

Type and token bigram frequencies for two-through nine-letter words and the prediction of anagram difficulty.

Recent research on anagram solution has produced two original findings. First, it has shown that a new bigram frequency measure called top rank, which is based on a comparison of summed bigram frequencies, is an important predictor of anagram difficulty. Second, it has suggested that the measures from a type count are better than token measures at predicting anagram difficulty. Testing these hy...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1998

Dependent Bigram Identification

نویسنده

چکیده

منابع مشابه

Language identification incorporating lexical information

Double bigram-decoding in phonotactic language identification

A Machine Learning Approach for the Identification of Bengali Noun-Noun Compound Multiword Expressions

Chinese Unknown Word Identification Based on Local Bigram Model with Integrally Smoothing Assumption

Chinese Unknown Word Identification Based on Local Bigram Model

Type and token bigram frequencies for two-through nine-letter words and the prediction of anagram difficulty.

عنوان ژورنال:

اشتراک گذاری